First Assesment - ArcMap and RStudio a Comparative Report

Report

590 words

This review is based on the maps produced by Arcmap and R-Studio for the first GIS assignment. Initially we are going to present the workflow followed to build the maps and secondly we are going to describe the process with each tool and lastly we are going conclude that both tools has their own advantages and disadvantages and its depending on the individual knowledge and resources to choose which one is suitable for the purpose.

The longest time on the process of building this map was on choose the data. This was obtained from the open source data from CEPAL (2017) about surveys in Latin America on different segments of the population classified by gender, educational level, country and the quantity of how many of these individual had the emigration intention. We choose to present the information for all the countries in Latin America were the survey was made with a focus on Higher Educational level for both genders. The data was exported on a csv file and it was cleaned from other titles and non alphanumeric characters.

First, with Arcmap we used the adminCountries baseMap from Esri and we merged this map with the prepared csv file. We created several maps trying to show the evolution of the quantities over the years. Even though, we had a 10-years period, after several intents decided to choose only 3 years and presented the map with this information only. The process was easy, and the main difficulty was to hope not getting out of resources. Due to the main objective was to present the information over time, the option of Time option was used but was remoreved latetly.

It presented an animation of the change of colors depending of the quantities per year. Although it works, it also hangs very consistently and the configuration lacked of flexibility to change the time window successfully, possible due to the lack of resources.

To build a map with R-studio was very straight forward. The material at the class it is very clear and help to build the first map fast. The first problem was at the moment of import the data file. The columns after import even though were numerical always were imported as characters avoiding the numerical operations. We changed to another import command and finally it worked well. Building the map was fast and even char to number conversion its easy . RStudio does not requires as much resources as ARCmap but R language knowledge. There are many tutorials on the Internet, still the material given in class was basically enough. Some of the problems like formatting, types, other libraries could be solved at the site library documentation. On the other hand, the tmap library allows the option of an interactive map. But mostly not in the way that we intended. Due to the interaction was based on zooming the map or focusing on another coordinate or changing the base maps as well. Still changing the information over a time frame fluently could not be achieved at this moment.

In summary , both tools have their own advantages. It is depending on the individual use, knowledge and resources available that it will help to decide which one is more suitable for the purpose. We should say that R-Studio is an open source tool at the contrary of ARCmap. Maps can be build with much less resources than ArcMap and on a very reduced time frame. Nevertheless the resource needed is knowledge about the programming and available libraries to work with.

A. Maps with Rtudio

First we need to set java_home and to load some libraries

library("sf")
library("tmap")

The information seems to be ok, now we plot the first map

tmap_mode("plot")
tm_shape(WBCOPER2) + tm_polygons("YEAR2015")

Now we are going to add more data to compare different years.

tm_shape(WBCOPER2) + tm_fill(col=c("YEAR2010","YEAR2010","YEAR2015"),title=c('% Higher Education Emigrate Intention 2005','2010','2015')) + 
  tm_borders()
Maps are presenting the right information. But some adjustments are needed, labels are overlaping the maps and we need to add a general title.

Final Basic MAP



B. Maps with ArcMap

  >The layers

Layers en ArcMap

Layers en ArcMap

 

Symbology

   

Final Map

 

 

Second Assesment

 

Report

560 words

 

Second assessment comprehend a set of 6 tasks to be solved by the tools of our choice. Five files were given to realize the analysis.

  • The Hunt (Hunt7) , is the travel made with all the places visited by team 7 last year (2017).

  • The Hunt Addresses (HuntA) , are all the places the should be visited with geographic reference. At the beginning an r routine were provided to extract the geographic coordinates from google maps. The routine was executed , but an api key problem rises that avoided that routine to return data. Due to this inconvenience an alternate file was provided with the geographic Information.

  • The hunt score (HuntScore), this file contains all the places that should be visited with the scores or points obtained when the team achieve the goal.

  • The city of London Wards, shapefile atlas with City of London merged. This same file was used at class-lab.

  • TFL Tube Stations

We decided to present the solution with R-Studio , a parallel analysis was made with ArcMap to corroborate to accuracy of the solutions.

Firstly , since no detailed instructions about The Treasure Hunt was given some assumption were made:

An outlier was found at the point in the HuntAdresses , we assumed that was a mistake since the coordinates do not correspond to the place. These coordinates were corrected with google maps.

The score file , shows the points obtained when the team visited the assigned places , and a bonus point was awarded when they visited a special category. This is the assumption on how the points were calculated for this exercise.

Mostly of the libraries and the code used to solved this assessment was studied at class. Nevertheless additional resources were consulted to solve some of the tasks.

One of the libraries used , Matrix, allowed us to calculate all the distances between the points visited , but it requires a longer time to calculate. The advantage is that all distances between the points are available.

Buffer , was necessary to some tasks. This ones were created to intersect the data with the buffer .Since, some of the intersection were made made between spatial points and spatial lines . To succeed in the intersection a small buffer was made to obtain a resulting set.

Besides the data preparation , the most import point was to understand how significant is the understand projections. Some of the results varied on quantity or units accordingly to the projection utilized. In addition some libraries required the the data projected , while other worked well with geodesic coordinates.

The last task , requires geographic statistics. It is relevant to notice that , randomness and clustering here also has a scale component . If we analyze the information from the world perspective , all the points are clustered. But , on the way that we increment the scale (1:1000->1:100) , randomness is more evident and small clusters appeared.

In summary, again as the first assessment , a clear understanding of the concept reviewed at class was necessary to solve this assignment , and a knowledge of the libraries and how can you use them build code. As a drawback , the code in this assignment was not optimized. In the future, optimization of the code can be done.

 

To Obtain the data:

 

hunt7 <- geojson_read("https://www.dropbox.com/s/wa2ip35tcmt93g3/Team7.geojson?raw=1", what = "sp")

Lets write the original file for backup

writeOGR(hunt7, ".", "huntoriginal", driver="ESRI Shapefile") 
 huntaddresses <- read_csv("https://www.dropbox.com/s/v66l4cx7aia9jlo/huntLocations.csv?raw=1")

huntaddresses <- read_csv("huntaddresses.csv")

 

1. How far did you travel (according to the trace they recorded We all can calculate google travel distance using routing service so do not do this.

Two solutions were found –> a) merge the hunt7 to one line and calculate the lenght b) Build a distance matrix and calculate the distance between the first and the last point by applying matrix functions.

 

Libraries required

library(geojsonio) # to read geojson
library(rgdal) # to write shapefile
library(Matrix) # Matrix operations
library(rgeos) # Lenght line
library(sf) # read csv

Solution –> a) Merge&Lenght. We use the merge fuction to get one line , project the resulting spatial line and calculate the lenght with rgeos::gLenght

huntoneline <- gLineMerge(hunt7)
huntonelinep <- spTransform(huntoneline, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
distoneline<-rgeos::gLength(huntonelinep, byid = FALSE)

> distoneline = 47213.58 (mts)

 

Solution –> b) Distance Matrix. We calculate teh distance Matrix, this has the distance between all the points. Because we are interested between the first and the last point, we make zeros all the other distance points and sum the resulting diagonal. This solution result in the distance between all the points , but calculation is time and resource consuming compared with solution a.

t<-geometry(hunt7)
coordenada <- t[1,]@lines[[1]]@Lines[[1]]
temp <- as.data.frame(coordenada@coords)
distancia<-distm(temp, fun=distGeo)
d2<-triu(distancia, k = 0, diag = FALSE)
d2[1:5,1:5]
l<-tril(distancia, k = 1, diag = TRUE)
l<-triu(l,k=1)
l[1:5,1:5]
l2<-sum(l)
l2/1000
Distance with Matrix

Distance with Matrix

 

2. How many TfL station did your route pass within 100 metres distance?

Libraries required


library(geojsonio) # to read geojson
library(rgdal) # to write shapefile
library(Matrix) # Matrix operations
library(rgeos) # Lenght line

Solution –> We convert hunt7 and tube stations to spatial objects, next we create a 100 mts buffer and we project the resulting spatial objects, lastly we intersect the hunt7 buffer with the tube stations

hunt7 <- read_shape("hunt.shp",as.sf = TRUE)
class(hunt7)
tubestations <- read_shape("tubestations.shp",as.sf = TRUE)
hunt7sf <- st_as_sf(hunt7)
hunt7SP <- as(hunt7sf, "Spatial")
writeOGR(hunt7SP, ".", "hunt1.shp", driver="ESRI Shapefile") 
tubestationsp <- as(tubestations, "Spatial")

######   Project

hunt7p <- spTransform(hunt7SP, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
tubestationsp<-spTransform(tubestationsp, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
bufferhunt7<-gBuffer(hunt7p, width = 100)

#######    Intersect

Tubestattion100 <- tubestationsp[bufferhunt7, ]
plot(Tubestattion100) # test the clip succeeded
summary(Tubestattion100)
Tube Stations in 100 mts from Hunt7

Tube Stations in 100 mts from Hunt7

   

3. How many points did you score based on treasure hunt locations they managed to get within 300 metres of? Download the csv file of the location and scores from Moodle

Solution, Total points= (Points by location) + Bonus points Bonus Point = category(point).


> Libraries required

library(sp)
library(sf)
library(rgeos)
library(tidyverse)
require(rgdal)


###### Read the files
## Convert to SPATIAL
huntsf <- st_as_sf(hunt)
huntSP <- as(huntsf, "Spatial")

coordinates(huntscore) <- cbind(huntscore$lon , huntscore$lat)
class(coordinates)
huntscoredots <- st_as_sf(x = huntscore, 
                        coords = coordinates(),
                        crs = "+proj=longlat +datum=WGS84")

huntscoredotssf <- st_as_sf(huntscoredots)
huntscoredotsSP <- as(huntscoredotssf, "Spatial")

################# Proyect

huntp <- spTransform(huntSP, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
proj4string(huntscoredotsSP) = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
huntscoredotsp<-spTransform(huntscoredotsSP, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
################################################
#### Create buffer
bufferhunt<-gBuffer(huntp, width = 300)
##################################################################
Placesprice <- huntscoredotsp[bufferhunt, ]
##########################
Placesprocesf <- st_as_sf(Placesprice)
####### Change column name
colnames(Placesprocesf)[which(names(Placesprocesf) == "Points.y")] <- "puntos"

#### Points Calculation
### The point file was prepared addint a bonus column, this was set accorndly with the points on the original file.

totalpuntos<-dplyr::select(as.data.frame(Placesprocesf), -geometry)
totalpuntos <- totalpuntos[,-(1:4)]
totalpuntos <- totalpuntos[,-(3:6)]
totalpuntos <- totalpuntos[,-(4:6)]

sum(totalpuntos)

Total Puntos [1] 420


### 4. Which Wards did you pass through that had the (a) lowest and (b) the highest rates of Male Life Expectancy?


> Solution –> We convert hunt7 and LondonData to spatial objects, next we intersect the resulting sets and we project the resulting spatial objects, lastly we calculate the max and the min of the required data.

Libraries required


library(sp)
library(sf)
library(rgdal)
library(tmaptools)## data frame operations
############ Read Data and transform to spatial

LondonData <- read_shape("LondonData.shp")
class(LondonData)
LondonDatasf <- st_as_sf(LondonData)
LondonDataSP <- as(LondonDatasf, "Spatial")
class(LondonDataSP)
hunt7 <- read_shape("hunt1.shp",as.sf = TRUE)
hunt7sp<-as(hunt7, "Spatial")

########### Project data

cr1<-CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")
LondonDatap <- spTransform(LondonDataSP, CRS=cr1)
hunt7p <- spTransform(hunt7sp, CRS=cr1)

#### Intersect london data with hunt7

LondonDataI<-LondonDatap[hunt7p, ]
plot(LondonDataI)

##### Find max, min

maxmale<-which.max(LondonDataI$"MaleLE0509")
Minmale<-which.min(LondonDataI$"MaleLE0509")

MaxBorough<-LondonDataI[maxmale,"WD11NM"]
MinBorough<-LondonDataI[Minmale,"WD11NM"]
MaxBorough
MinBorough

LondonDataTASK4<-LondonDataI[c(maxmale,Minmale),]
plot(LondonDataTASK4, col="yellow") 

  > (a) lowest and (b) the highest rates of Male Life Expectancy

   

5. Taking the average of all Wards that you passed through, what was the average life expectancy at birth for babies born in those wards along the whole route? This can be both Male and Female life expectancies?

Solution- We use the LondonData from class, Interesect the Data with the Hunt7 trajectory and calculate the mean from the resultant set.


######### Data read and transform spatial
LondonData <- read_shape("LondonData.shp")
hunt <- read_shape("hunt1.shp",as.sf = TRUE)
LondonDataF<-data.frame(LondonData)
names(LondonDataF)
LondonDatasp <- as(LondonData, "Spatial")
plot(LondonDatasp)
huntsp<-as(hunt, "Spatial")

#######Projection

LondonDatap <- spTransform(LondonDatasp, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))
huntp <- spTransform(huntsp, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))

####Intersect

ResultsParcial<-LondonDatap[huntp, ]

########## Data select
MeanAge<-ResultsParcial[c("MaleLE0509","FemaleLE05")]

########## Calculate mean
Results<-data.frame(MeanAge)
Results
sapply(Results,mean)

plot(ResultsParcial,col="green")
plot(huntp, add = TRUE, col = 'red')

 

Average life expectancy at birth for babies born in those wards along the whole rout

Average life expectancy at birth for babies born in those wards along the whole rout

 

6. Is there any spatial patterns for CASA Treasure Hunt locations or are they randomly distributed?

Solution->The file Hunt Addresses has an outlier in the point Platform 9 & 3/4, Kings Cross, London. After corroborate the coordinates, an error was found, so the coordinates were corrected on the file. First to evaluate the data , we converted the HuntAdresses to points and get the Replays’k to evaluate if there is a dispersion (random), clustering or randomly distributed (Poison distribution for example) on the data. After the Riplays’k graphics leads us to a possible cluster, so we ran the silhouette algorithm and plot the clusters. Therefore, because the information is clustered and apparently does not follow a distribution, we can conclude that apparently is not random.

Libraries required

library(sp)
library(sf)
library(spatstat)
library(tmap)
library(cluster)
library(tidyverse)  ## read csv

 


############# Read , convert to spatial and project the data

huntaddresses <- read_csv("huntaddresscorr.csv")
coordinates(huntaddresses) <- cbind(huntaddresses$lon , huntaddresses$lat)
rango<-cbind(huntaddresses$lon , huntaddresses$lat)
proj4string(huntaddresses) = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
huntaddressesp <- spTransform(huntaddresses, CRS=CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"))

##############
############## Plot the points on the map
library(leaflet)
tmap_mode("view")
tm_shape(huntaddresses) + tm_dots(title = "Hunt", border.col = "green", border.lwd = 0.1, border.alpha = 0.2, col = "red",
                                  style = "quantile", palette = "Reds")
Hunt Addresses in London

Hunt Addresses in London

############## Convert the coordinates to points
Huntpattern <- ppp(rango[,1], rango[,2], c(-0.5,0.5), c(51,52))
############
# Importing Point data
#############
x_min<-min(huntaddresses$lon)
x_max<-max(huntaddresses$lon)
y_min<-min(huntaddresses$lat)
y_max<-max(huntaddresses$lat)

mypattern<-ppp(huntaddresses$lon,huntaddresses$lat, c(x_min,x_max), c(y_min,y_max)) 
plot(huntaddresses$lon, huntaddresses$lat, xlab='t', ylab='t-1',col='blue')

######
###### Riplays'k 
#####
K.hunt <- Kest(mypattern, correction="best")
plot(K.hunt, main="Ripley's K function for Treasure Hunt Addresses")
Riplay’s K

Riplay’s K

######### Determine the number of clusters, silhouette 
mydata <- (rango)
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, 
                                     centers=i)$withi.ss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")
Silhouette

Silhouette

 

##K-Means Cluster Analysis
pamClus <- pam(rango, 3) 
plot(pamClus)
summary(pamClus)
Clusters

Clusters

Third Assesment-Mini Research Proposal

Report

1600 words

To what extend the Readiness for Business ODB indicator is influenced by Implementation Innovation Scaled, Implementation Social Scaled, Implementation Accountability Scaled and Implementation Datasets Average in Open Data Barometer?

At the beginning, a proposal to evaluate the coral reefs fauna and coral reefs bleaching relation was made. In spite of the efforts, it was unsuccessful to obtain the data from the Caribbean Reef. Therefore, an analysis of the Samoa Reef was conducted. The information obtained for the fishes corresponded at the American Samoa. Bathymetry for Samoa was obtained, but the relation between coral reef fauna and coral reed bleaching by linear regression could not be done, due to the good health of the coral reef. A biodiversity analysis is possible, this can be calculated by a simple ratio. Due to this assessment requirement complexity, we decide to do an additional analysis.

After researching information and realizing that most of the countries have few or none OpenData. An article was found from United Nations about The Open Data Barometer Ranking (ODB). This ranking represent a status of Open Data from different countries all over the world and evaluates their advance to deliver OpenData. ODB can be applied to countries or cities. The Open Data Charter mark the commitment to deliver Open Data and until 2016,only 1 in 9 countries have Open Data. Auspice by World Wide Web Foundation and World Bank among others, ODB principles include open data for everyone, data easy to use and data that cover what people needs. This ranking is based on several indicators grouped by readiness, implementation and impact.

Not all the countries are surveyed with ODB. In order to be included the country or the city need to adopt the Open Data Charter (ODC), which is a collaboration between government and data experts o Open Data, its and independent programa financed by “The Fund for the City of New York”. The ODC also has as sponsors The World Bank Group, United Nations, Organization of American States (OAS), Organization for Economic Co-operation and Development (OECD), Inter-American Development Bank (IADB), African Development Bank Group, Economic Commission for Latin America and the Caribbean (ECLAC/CEPAL).

Readiness is referred to deliver open data in a short term. This is applied not only to data generated by government but civilian and private sector as well. ODB is strengthened by polices and regulations, the implementation of those promotes the ODB initiative which can become obsolete if these are forgotten or never been implemented. The impact that ODB has on entrepreneurship, innovations, accountability and social responsibility can be seen in countries like Uruguay with the application “A tu servicio” ( https://www.datauy.org/portfolio/a-tu-servicio/) , Japan with citizen awareness of their electricity consumption (http://www.tepco.co.jp/forecast/index-j.html) and France with more than 200 projects of entrepreneurship. It is important to remember that all these indicator are related to the impact, readiness or implementation degree of Open Data.

Since economy is driven by Business, this report will analyze the relation between the Indicator Readiness for Business and 5 more Indicators: Readines for Policies, Implementation Innovation Scaled, Implementation Social Scaled, Implementation Accountability Scaled and Implementation Datasets Average. The expected outcome is to promote initiatives to improve the Open Data participation related to encourage Business Initiatives.

The analysis will be performed on Rstudio, and an additional Shiny application will be delivered , providing some flexibility to the same analysis.

Hypothesis

Null Hypothesis. There is no evidence of the influence between the Readiness for Business Indicator, and the rest of the indicators analized.

The null Hypothesis that will be rejected if the p-value <0.05

Data

The Open Data Barometer allows to download the data on a csv format.This report will use the ODB data set of 2016.

 

 

Methodology

  • Firstly, we are going utilize Pearson analysis to determine correlation with Indicator Readiness for Business (dependent variable). Correlation is strong between Readiness for Business Innovation and Datasets and slightly weaker with the other indicators.

  • Next, a Boxplot analysis was performed and identified that countries with a Business ready Indicator are clustered, and an outliers group are shown in the graphic. These outliers are countries with a higher ranking of the ODB index. A new dataset excluding the countries with a higher ranking on Readiness for Business will be used.

  • A regression analysis will be performed to determine if we can find a fitted model the explain the variation between the variables.

  • A shiny application will be developed to add flexibility to the analysis.

Implementation

The data obtained for the analysis was not geographically referenced, to do that we cross the data with the ESRI map Admo World Countries. The datasets were merged using the ISO_A3 country code.

The libraries and the code to perform the correlation, clustering and linear regression was obtained from several websites that can be found at the end of this report .

The shiny application was developed following the tutorial of http://shiny.rstudio.com . This site teaches you the basic steps to build a shiny application. Nevertheless , for a more efficient and modular app, more research is needed because that was outside of the limits provided for the site.

There were different sites that teach how to build a shiny app, but it was found that the site selected was simpler to understand and to replicate.

Shiny app was build In one file.This file includes the ui and server interface. The ui interface load the libraries and read the files. It also establish the look and feel of the app on a tab format. Each tab, executes a different statistic analysis . The histogram, cluster , Boxplot and regression are controlled by the List of Indicator (Reactor) and execute the analysis based on the Readiness for business indicator(default) and the indicator selected by the list. The map plot the Rank of the indicator selected on the list and the Pearson correlation analysis is performed on all the indicators , regardless of the one selected on the list. The server interface reacts to the ui interface. The implementation is actually simple and easy to build , it has some difficulties about r it self, as how and when to send a quoted variable, but it was simple to overcome. The main drawback is the optimization that was not done. This includes how to execute more than one block with adjust one reactor. It is possible due to the simple design of the application that all the tabs could be filled with just one rector , so the code could be smaller and simpler.

Discusion

The linear regression shows a p-value greater than our 0.05 confidence for all the indicators with the exception of Readiness for Policies Scaled and Implementation Datasets Average. This Readiness for Policies Scaled indicator measures the ability of the government, private sector and civilian to generate Open Data as collaborators. It’s also related to the capability of these sectors to read the data (Data Literacy skills), to use of the open data without restrictions and the economical resources to perform all these, the generation of the data, the Literacy skills and the exploration of the data. Policies and legislation may have an impact in readiness for Business, since the data can be considered as a right, and while data privacy is legislated in most developed countries, although developing countries do not enforce this right. Therefore, policies and legislation can provide the access for open data and in this way improve the opportunities to make business.

The other indicator, Implementation Datasets Average is related not only is there is data published, but also includes other factors like machine-readiness (ready for automatic processing.) , if it has any cost or free of charge, if it has an open data-License and how often is updated among others. These two indicators are related, they could be as a consequence of the other, but, the correlation between them is still acceptable for a linear regression.

The shiny application shows the relation between the Readiness for business Indicator and the others indicators involve . The application is simple, and the structure is easy to change. Even when this application allows making the analysis between two of the variables. It can be modified to include a different dataset and additional variables.

An ArcMap analysis was also performed. It is interesting to observe that ArcMap clustering, divide the countries in 2 clusters. The high ranking and the rest. In our R-studio Analysis to took as outliers the high group and performed the analysis only withouth the higher rankings.

Third Assessment Conclusion

In summary , there is an inverse relation with all the indicators and Readiness for Business , with the exception of Implementation of Datasets average and Policies. Nevertheless the p-value is greater that 0.05 and 2 of the 4 independent variables , so we can conclude that it may have a relation between Readiness for business ,

After finalizing all the assessments exercises and evaluating the effort needed , it is important to write about concepts. A programming language is a medium to communicate. Therefore, concepts about geography, coordinates, libraries and projections open opportunities use all of them on real life problems. The greater drawback is the limited Datasets available and after analyzing the ODB and considering that only the more developed countries are ranking higher for Open Data, more resources should be dedicated to develop Open Data for the goodness of the developing countries.

RStudio

RStudio provides several libraries , after trying several, the followin packages weere choosen.

###### Libraries Required
#########################

library(sp)
library(sf)
library(spatstat)
library(cluster)
library(tidyverse)
library(pastecs)
library(corrplot)
library(maps)
library(tmap)

################## Data
countries <-read_shape("countries.shp")
countriesdf <-st_as_sf(countries)
cranking <- read_csv("ODBRankings6.csv")
scores<-cranking[c(9:17)]
scores<-scores[-c(2:3)]
scores<-scores[-c(3)]
names(scores) <- c("ReadyPolicies", "ReadyBusiness","ImpInnovation","ImpSocial","ImpAccountability","ImpDatasets")
stat.desc(scores)
MyPattern<-cranking[c("Readiness_Business_Scaled","Readiness_Policies_Scaled")]
######### Correlation
m<-cor(scores,method="pearson")
summary(m)
M<-cor(scores)
corrplot(M, method="circle")
corrplot(M, method="number")
corrplot(M, type="upper")
corrplot(M, type="upper", order="hclust")
######### 

 

 

################## box plot
boxplot(scores$ReadyBusiness,scores$ReadyBusiness,scores$ReadyPolicies,scores$ImpInnovation,scores$ImpSocial,scores$ImpAccountability,scores$ImpDatasets) 

 


> We observed several outliers, concentraded on the countries with higher ranking.A new set with the countries with ranking lower than 85.


scorenotout<-scores[scores$ReadyBusiness<85, ]
boxplot(scorenotout$ReadyBusiness,scorenotout$ReadyBusiness,scorenotout$ReadyPolicies,scorenotout$ImpInnovation,scorenotout$ImpSocial,scorenotout$ImpAccountability,scorenotout$ImpDatasets) 


> After removing the outlier we run a Linear Regression.

fit <- lm(ReadyBusiness ~ ReadyPolicies+ImpInnovation+ImpSocial+ImpAccountability+ImpDatasets, data=scorenotout)
summary(fit)

 

 

RStudio-Shiny

A Shiny application was developed to complement the linear regression and add flexibility. It is presented on a Tab Format .

  • The first option Map the indicators:
  • Second Tab, present the Indicator Histogram
  • Third Tab, present the indicator BoxPlot
  • Fourth Tab, present the pearson correlation for all the Indicators
  • Fifth Tab, present the Linear regression per Indicator
  • Sixth Tab, present cluster’s analysis per Indicators

The code for the shiny application its based on Shiny.rstudio.com. (2019)


###Libraries
library(shiny)
library(leaflet)
library(sp)
library(sf)
library(dplyr)
library(tmaptools)
library(spatstat)
library(factoextra) ###Cluster plot
library(rgdal)
library(maps)
library(tmap)
library(tidyverse)
library(plotly)
library(RColorBrewer)
library(ggplot2)
library(mclust)
library(corrplot)
library(cluster)

###Reading the files

countries <-read_shape("countries.shp")
countriesdf <-st_as_sf(countries)
pal <- brewer.pal(10, "Set3")[c(10, 8, 4, 5)]
cranking <- read_csv("ODBRankings6.csv")
scores<-cranking[c(9:17)]
scores<-scores[-c(2:3)]
scores<-scores[-c(3)]
names(scores) <- c("ReadyPolicies", "ReadyBusiness","ImpInnovation","ImpSocial","ImpAccountability","ImpDatasets")

MyPattern<-countriesdf

# Define UI ----

ui <- fluidPage(
  titlePanel(" "),
    fluidRow(
            column(11,offset = 1, titlePanel("2016 Readiness for Business Analysis vs Other Indicators")) 
          ),
    fluidRow(
            column(6,offset = 1,
              selectInput(inputId = "Indicator",
                       label = "Select Indicator Type:", 
                       choices = c("Implementation Innovation Scaled", "Implementation Social Scaled",
                                      "Implementation Accountability Scaled", "Implementation Datasets Average","Population Rank")
              )
            )
          ),
  mainPanel(  
      tabsetPanel(
          tabPanel("map", leafletOutput("mapw", width = "100%", height = 400)), 
          tabPanel("Indicator Histogram",plotOutput(outputId = "HistPlot")), 
          tabPanel("Boxplot",plotOutput(outputId = "boxPlot")), 
          tabPanel("Pearson Correlation", plotOutput(outputId = "Pearplot")),
          tabPanel("Linear Regression", plotOutput(outputId = "Linearplot")),
          tabPanel("Cluster's ", plotOutput(outputId = "clusterplot"))
                 )
  )
)

# Define server logic ----

server <- function(input, output)
  {
  
        output$mapw <- renderLeaflet({
          
                  x  <- switch(input$Indicator,
                   "Implementation Innovation Scaled" = "Implm_S",
                   "Implementation Social Scaled" = "Impct_S",
                    "Implementation Accountability Scaled" = "Imp_A_S",
                   "Implementation Datasets Average" = "Imp_D_A",
                   "Population Rank" = "POP_RAN")
                    mapw <-read_shape("countries.shp")
                    mapw <- tm_shape(mapw)+ tm_polygons(x) 
                    tmap_leaflet(mapw) 
             })
        
  output$HistPlot <- renderPlot({
    
                  x  <- switch(input$Indicator,
                         "Implementation Innovation Scaled" = countries$Implm_S,
                         "Implementation Social Scaled" = countries$Impct_S,
                         "Implementation Accountability Scaled" = countries$Imp_A_S,
                         "Implementation Datasets Average" = countries$Imp_D_A,
                         "Population Rank" = countries$POP_RAN)
    
                 hist(x,breaks=15,xlab = input$Indicator,main = "World ODBRank",col="darkmagenta")
    
                 })
  output$boxPlot <- renderPlot({
    
    x  <- switch(input$Indicator,
                 "Implementation Innovation Scaled" = countries$Implm_S,
                 "Implementation Social Scaled" = countries$Impct_S,
                 "Implementation Accountability Scaled" = countries$Imp_A_S,
                 "Implementation Datasets Average" = countries$Imp_D_A,
                 "Population Rank" = countries$POP_RAN)

    boxplot(x, main=input$Indicator,col = "forestgreen")
                                            
    
  })
  
  output$Pearplot <- renderPlot({
    
             m<-isolate(cor(scores,method="pearson"))
             corrplot(m, method="number")
           
  })
  
  output$Linearplot <- renderPlot({
    w  <- switch(input$Indicator,
                 "Implementation Innovation Scaled" = countriesdf$Implm_S,
                 "Implementation Social Scaled" = countriesdf$Impl_S_S,
                 "Implementation Accountability Scaled" = countriesdf$Imp_A_S,
                 "Implementation Datasets Average" = countriesdf$Imp_D_A,
                 "Population Rank" = countriesdf$POP_RAN)
        ggplotRegression <- function (fit) {
        ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
        geom_point() +
        stat_smooth(method = "lm", col = "red") +
        labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                           "Intercept =",signif(fit$coef[[1]],5 ),
                           " Slope =",signif(fit$coef[[2]], 5),
                           " P =",signif(summary(fit)$coef[2,4], 5)))
    }
    ggplotRegression(lm(Rdn_B_S ~ w, data = countriesdf))

  })
   output$clusterplot <- renderPlot({

     w  <- switch(input$Indicator,
                  "Implementation Innovation Scaled" = 3,
                  "Implementation Social Scaled" = 4,
                  "Implementation Accountability Scaled" = 5,
                  "Implementation Datasets Average" = 6,
                  "Population Rank" = 1)
     
                       MyPattern<-scores[c(w,2)]
                       k2 <- kmeans(MyPattern, centers = 3, nstart = 25)
                       fviz_cluster(k2, data = MyPattern,main = "Cluster plot", xlab = input$Indicator,
                                    ylab = "Readinnes for Bussiness")
     })
  
  }


# Run the app ----
shinyApp(ui = ui, server = server)
  • Here , The bathymetry map. Even when the analysis was not complete and its unrelated, it’s invaluable the knowledge acquired about how difficult is to make a map from scratch.

Bibliography